24 research outputs found

    Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices

    Full text link
    Identifying similar protein sequences is a core step in many computational biology pipelines such as detection of homologous protein sequences, generation of similarity protein graphs for downstream analysis, functional annotation and gene location. Performance and scalability of protein similarity searches have proven to be a bottleneck in many bioinformatics pipelines due to increases in cheap and abundant sequencing data. This work presents a new distributed-memory software, PASTIS. PASTIS relies on sparse matrix computations for efficient identification of possibly similar proteins. We use distributed sparse matrices for scalability and show that the sparse matrix infrastructure is a great fit for protein similarity searches when coupled with a fully-distributed dictionary of sequences that allows remote sequence requests to be fulfilled. Our algorithm incorporates the unique bias in amino acid sequence substitution in searches without altering the basic sparse matrix model, and in turn, achieves ideal scaling up to millions of protein sequences.Comment: To appear in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'20

    Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Modern pyrosequencing techniques make it possible to study complex bacterial populations, such as <it>16S rRNA</it>, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential gene clusters and families, but such analysis represents a daunting computational task. The aim of this work is the development of an efficient pipeline for the clustering of large sequence read sets.</p> <p>Methods</p> <p>Pairwise alignment techniques are used here to calculate genetic distances between sequence pairs. These methods are pleasingly parallel and have been shown to more accurately reflect accurate genetic distances in highly variable regions of <it>rRNA </it>genes than do traditional multiple sequence alignment (MSA) approaches. By utilizing Needleman-Wunsch (NW) pairwise alignment in conjunction with novel implementations of interpolative multidimensional scaling (MDS), we have developed an effective method for visualizing massive biosequence data sets and quickly identifying potential gene clusters.</p> <p>Results</p> <p>This study demonstrates the use of interpolative MDS to obtain clustering results that are qualitatively similar to those obtained through full MDS, but with substantial cost savings. In particular, the wall clock time required to cluster a set of 100,000 sequences has been reduced from seven hours to less than one hour through the use of interpolative MDS.</p> <p>Conclusions</p> <p>Although work remains to be done in selecting the optimal training set size for interpolative MDS, substantial computational cost savings will allow us to cluster much larger sequence sets in the future.</p

    The Parallelism Motifs of Genomic Data Analysis

    Get PDF
    Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

    Phylogenetically Structured Differences in rRNA Gene Sequence Variation among Species of Arbuscular Mycorrhizal Fungi and Their Implications for Sequence Clustering

    Get PDF
    Arbuscular mycorrhizal (AM) fungi form mutualisms with plant roots that increase plant growth and shape plant communities. Each AM fungal cell contains a large amount of genetic diversity, but it is unclear if this diversity varies across evolutionary lineages. We found that sequence variation in the nuclear large-subunit (LSU) rRNA gene from 29 isolates representing 21 AM fungal species generally assorted into genus- and species-level clades, with the exception of species of the genera Claroideoglomus and Entrophospora. However, there were significant differences in the levels of sequence variation across the phylogeny and between genera, indicating that it is an evolutionarily constrained trait in AM fungi. These consistent patterns of sequence variation across both phylogenetic and taxonomic groups pose challenges to interpreting operational taxonomic units (OTUs) as approximations of species-level groups of AM fungi. We demonstrate that the OTUs produced by five sequence clustering methods using 97% or equivalent sequence similarity thresholds failed to match the expected species of AM fungi, although OTUs from AbundantOTU, CD-HIT-OTU, and CROP corresponded better to species than did OTUs from mothur or UPARSE. This lack of OTU-to-species correspondence resulted both from sequences of one species being split into multiple OTUs and from sequences of multiple species being lumped into the same OTU. The OTU richness therefore will not reliably correspond to the AM fungal species richness in environmental samples. Conservatively, this error can overestimate species richness by 4-fold or underestimate richness by one-half, and the direction of this error will depend on the genera represented in the sample. IMPORTANCE Arbuscular mycorrhizal (AM) fungi form important mutualisms with the roots of most plant species. Individual AM fungi are genetically diverse, but it is unclear whether the level of this diversity differs among evolutionary lineages. We found that the amount of sequence variation in an rRNA gene that is commonly used to identify AM fungal species varied significantly between evolutionary groups that correspond to different genera, with the exception of two genera that are genetically indistinguishable from each other. When we clustered groups of similar sequences into operational taxonomic units (OTUs) using five different clustering methods, these patterns of sequence variation caused the number of OTUs to either over- or underestimate the actual number of AM fungal species, depending on the genus. Our results indicate that OTU-based inferences about AM fungal species composition from environmental sequences can be improved if they take these taxonomically structured patterns of sequence variation into account

    Towards a systematic study of big data performance and benchmarking

    No full text
    Big data queries are increasing in complexity and the performance of data analytics is of growing importance. To this end, Big Data on high-performance computing (HPC) infrastructure is becoming a pathway to high-performance data analytics. The state of performance studies on this convergence between Big Data and HPC, however, is limited and ad hoc. A systematic performance study is thus timely and forms the core of this research. This thesis investigates the challenges involved in developing Big Data applications with significant computations and strict latency guarantees on multicore HPC clusters. Three key areas it considers are thread models, affinity, and communication mechanisms. Thread models discuss the challenges of exploiting intra-node parallelism on modern multicore chips, while affinity looks at data locality and Non-Uniform Memory Access (NUMA) effects. Communication mechanisms investigate the difficulties of Big Data communications. For example, parallel machine learning depends on collective communications, unlike classic scientific simulations, which mostly use neighbor communications. Minimizing this cost while scaling out to higher parallelisms requires non-trivial optimizations, especially when using high-level languages such as Java or Scala. The investigation also includes a discussion on performance implications of different programming models such as dataflow and message passing used in Big Data analytics. The optimizations identified in this research are incorporated in developing the Scalable Parallel Interoperable Data Analytics Library (SPIDAL) in Java, which includes a collection of multidimensional scaling and clustering algorithms optimized to run on HPC clusters. Besides presenting performance optimizations, this thesis explores a novel scheme for characterizing Big Data benchmarks. Fundamentally, a benchmark evaluates a certain performance-related aspect of a given system. For example, HPC benchmarks such as LINPACK and NAS Parallel Benchmark (NPB) evaluate the floating-point operations (flops) per second through a computational workload. The challenge with Big Data workloads is the diversity of their applications, which makes it impossible to classify them along a single dimension. Convergence Diamonds (CDs) is a multifaceted scheme that identifies four dimensions of Big Data workloads. These dimensions are problem architecture, execution, data source and style, and processing view. The performance optimizations together with the richness of CDs provide a systematic guide to developing high-performance Big Data benchmarks, specifically targeting data analytics on large, multicore HPC clusters
    corecore